OCPBUGS-49764: bindata/alerts/slo: improve burnrate calculation #1744

dgrisonnet · 2024-09-26T16:06:00Z

The problem that I recently noticed with the existing expression is that when we compute the overall burnrate from write and read requests, we take the ratio of successful read requests and we sum it to the one of write requests. But both of these ratios are calculated against their relevant request type, not the total number of requests. This is only correct when the proportion of write and read requests is equal.

For example, let's imagine a scenario where 40% of requests are write requests and their success during a disruption is only 50%. Whilst for read requests we have 90% of success.

apiserver_request:burnrate1h{verb="write"} would be equal to 2/4 and apiserver_request:burnrate1h{verb="read"} would be 1/6.
The sum of these as these by the alert today would be equal to 2/4+1/6=2/3 when in reality, the ratio of successful requests should be 2/10*1/10=3/10. So there is quite a huge difference today when we don't account for the total number of requests.

vrutkovs · 2024-09-26T17:49:33Z

/cc

vrutkovs · 2024-10-14T14:16:53Z

That makes sense to me, other burnrates (burnrate6h etc.) should be updated as well

openshift-bot · 2025-01-13T01:00:43Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Calculate the request burn rate based on the total number of read+write requests instead of separately calculating the burn rate for each request type. This used to cause an erroneous result when summing up the read and write burn rates together as it wouldn't account for the propertion of failures amongst all requests. Signed-off-by: Damien Grisonnet <[email protected]>

dgrisonnet · 2025-02-03T09:11:12Z

/remove-lifecycle stale

dgrisonnet · 2025-02-03T09:11:22Z

/retest-required

openshift-ci · 2025-02-03T13:26:11Z

@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`275f05d`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-gcp-operator-single-node	`275f05d`	link	false	`/test e2e-gcp-operator-single-node`
ci/prow/e2e-aws-operator-disruptive-single-node	`275f05d`	link	false	`/test e2e-aws-operator-disruptive-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

vrutkovs

/lgtm

openshift-ci · 2025-02-03T14:15:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, vrutkovs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgrisonnet,vrutkovs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-02-03T14:56:26Z

@dgrisonnet: This pull request references Jira Issue OCPBUGS-49764, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The problem that I recently noticed with the existing expression is that when we compute the overall burnrate from write and read requests, we take the ratio of successful read requests and we sum it to the one of write requests. But both of these ratios are calculated against their relevant request type, not the total number of requests. This is only correct when the proportion of write and read requests is equal.

For example, let's imagine a scenario where 40% of requests are write requests and their success during a disruption is only 50%. Whilst for read requests we have 90% of success.

apiserver_request:burnrate1h{verb="write"} would be equal to 2/4 and apiserver_request:burnrate1h{verb="read"} would be 1/6.
The sum of these as these by the alert today would be equal to 2/4+1/6=2/3 when in reality, the ratio of successful requests should be 2/10*1/10=3/10. So there is quite a huge difference today when we don't account for the total number of requests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dgrisonnet · 2025-02-03T14:56:31Z

/jira refresh
/retest-required

openshift-ci-robot · 2025-02-03T14:56:36Z

@dgrisonnet: This pull request references Jira Issue OCPBUGS-49764, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

/jira refresh
/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dgrisonnet · 2025-02-05T13:43:22Z

/label acknowledge-critical-fixes-only

dgrisonnet · 2025-02-05T13:47:02Z

/cherry-pick release-4.18

openshift-cherrypick-robot · 2025-02-05T13:47:05Z

@dgrisonnet: once the present PR merges, I will cherry-pick it on top of release-4.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-02-05T16:31:33Z

/retest-required

Remaining retests: 0 against base HEAD 1537626 and 2 for PR HEAD 275f05d in total

dgrisonnet · 2025-02-05T18:49:56Z

/retest-required

openshift-ci-robot · 2025-02-05T18:57:38Z

@dgrisonnet: Jira Issue OCPBUGS-49764: All pull requests linked via external trackers have merged:

openshift/cluster-kube-apiserver-operator#1744

Jira Issue OCPBUGS-49764 has been moved to the MODIFIED state.

In response to this:

The problem that I recently noticed with the existing expression is that when we compute the overall burnrate from write and read requests, we take the ratio of successful read requests and we sum it to the one of write requests. But both of these ratios are calculated against their relevant request type, not the total number of requests. This is only correct when the proportion of write and read requests is equal.

For example, let's imagine a scenario where 40% of requests are write requests and their success during a disruption is only 50%. Whilst for read requests we have 90% of success.

apiserver_request:burnrate1h{verb="write"} would be equal to 2/4 and apiserver_request:burnrate1h{verb="read"} would be 1/6.
The sum of these as these by the alert today would be equal to 2/4+1/6=2/3 when in reality, the ratio of successful requests should be 2/10*1/10=3/10. So there is quite a huge difference today when we don't account for the total number of requests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-cherrypick-robot · 2025-02-05T18:58:26Z

@dgrisonnet: new pull request created: #1797

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-bot · 2025-02-05T22:08:59Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-kube-apiserver-operator
This PR has been included in build ose-cluster-kube-apiserver-operator-container-v4.19.0-202502052138.p0.gf90c0d9.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2024

openshift-ci bot requested review from benluddy and deads2k September 26, 2024 16:11

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 26, 2024

openshift-ci bot requested a review from vrutkovs September 26, 2024 17:49

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2025

dgrisonnet force-pushed the improve-burnrate branch from 56d01d8 to 275f05d Compare January 31, 2025 14:24

dgrisonnet changed the title ~~WIP: bindata/alerts/slo: improve burnrate calculation~~ bindata/alerts/slo: improve burnrate calculation Jan 31, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2025

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2025

vrutkovs approved these changes Feb 3, 2025

View reviewed changes

openshift-ci bot assigned vrutkovs Feb 3, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2025

dgrisonnet changed the title ~~bindata/alerts/slo: improve burnrate calculation~~ OCPBUGS-49764: bindata/alerts/slo: improve burnrate calculation Feb 3, 2025

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 3, 2025

openshift-ci bot requested a review from wangke19 February 3, 2025 14:56

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Feb 5, 2025

openshift-merge-bot bot merged commit f90c0d9 into openshift:master Feb 5, 2025
13 of 16 checks passed

openshift-cherrypick-robot mentioned this pull request Feb 5, 2025

[release-4.18] OCPBUGS-49898: bindata/alerts/slo: improve burnrate calculation #1797

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-49764: bindata/alerts/slo: improve burnrate calculation #1744

OCPBUGS-49764: bindata/alerts/slo: improve burnrate calculation #1744

dgrisonnet commented Sep 26, 2024 •

edited

Loading

vrutkovs commented Sep 26, 2024

vrutkovs commented Oct 14, 2024

openshift-bot commented Jan 13, 2025

dgrisonnet commented Feb 3, 2025

dgrisonnet commented Feb 3, 2025

openshift-ci bot commented Feb 3, 2025 •

edited

Loading

vrutkovs left a comment

openshift-ci bot commented Feb 3, 2025

openshift-ci-robot commented Feb 3, 2025

dgrisonnet commented Feb 3, 2025

openshift-ci-robot commented Feb 3, 2025

dgrisonnet commented Feb 5, 2025

dgrisonnet commented Feb 5, 2025

openshift-cherrypick-robot commented Feb 5, 2025

openshift-ci-robot commented Feb 5, 2025

dgrisonnet commented Feb 5, 2025

openshift-ci-robot commented Feb 5, 2025

openshift-cherrypick-robot commented Feb 5, 2025

openshift-bot commented Feb 5, 2025

OCPBUGS-49764: bindata/alerts/slo: improve burnrate calculation #1744

OCPBUGS-49764: bindata/alerts/slo: improve burnrate calculation #1744

Conversation

dgrisonnet commented Sep 26, 2024 • edited Loading

vrutkovs commented Sep 26, 2024

vrutkovs commented Oct 14, 2024

openshift-bot commented Jan 13, 2025

dgrisonnet commented Feb 3, 2025

dgrisonnet commented Feb 3, 2025

openshift-ci bot commented Feb 3, 2025 • edited Loading

vrutkovs left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 3, 2025

openshift-ci-robot commented Feb 3, 2025

dgrisonnet commented Feb 3, 2025

openshift-ci-robot commented Feb 3, 2025

dgrisonnet commented Feb 5, 2025

dgrisonnet commented Feb 5, 2025

openshift-cherrypick-robot commented Feb 5, 2025

openshift-ci-robot commented Feb 5, 2025

dgrisonnet commented Feb 5, 2025

openshift-ci-robot commented Feb 5, 2025

openshift-cherrypick-robot commented Feb 5, 2025

openshift-bot commented Feb 5, 2025

dgrisonnet commented Sep 26, 2024 •

edited

Loading

openshift-ci bot commented Feb 3, 2025 •

edited

Loading